Fused Cross Entropy Loss #1601

winglian · 2024-05-07T20:19:18Z

pytorch/pytorch#124480 served as the impetus for getting this integrated.

Adapted the code from @mgmalek's https://github.com/mgmalek/efficient_cross_entropy/blob/main/modules.py so that it properly handles causal/next-token prediction.

YouJiacheng · 2024-05-09T17:59:53Z

mgmalek's implementation might have numerical issues when using half precision dtype.
The gradient w.r.t. the weight is accumulated through chunks with a half precision accumulator.
Some users have reported loss curve mismatch on Twitter.

winglian · 2024-05-10T17:35:14Z

mgmalek's implementation might have numerical issues when using half precision dtype.
The gradient w.r.t. the weight is accumulated through chunks with a half precision accumulator.
Some users have reported loss curve mismatch on Twitter.

haha, yeah, that was probably me reporting the loss curve differences out on twitter. Do you have any insights on the best way to fix the kernels? standard CEL uses bfloat16 as well iirc.

YouJiacheng · 2024-05-10T17:58:26Z

The problem is the lm_head weight (proj_weight), not CEL. By default bf16 matmul use fp32 accumulator (along the GEMM K axis). Since matmul is performed tile by tile in M&N axis (MxK@KxN matmul), registers are used as accumulator, without the need of a fp32 global memory buffer.

However, here we perform chunking along the GEMM K axis (DxT@TxV matmul)…

A simple fix can be using fp32 grad_proj_weight, but that will incur a significant memory overhead.

Can you check the loss curve with fp32 grad_proj_weight?

winglian · 2024-05-11T03:25:34Z

@YouJiacheng

gave your recommendation a try, but changing grad_proj_weight to fp32 didn't seem to help much.

YouJiacheng · 2024-05-13T14:27:46Z

Ah Oh, now we need more investigation…

YouJiacheng · 2024-05-13T14:30:29Z

It seems that the fused version has a different loss & grad_norm from the first step? That looks strange…

winglian · 2024-08-26T16:02:08Z

closing now that Liger is available

winglian added 8 commits May 4, 2024 09:36

integrate fused CEL kernel

25a40fa

fix import, allow for eval path

0666e45

fix labels

ec0ac2b

improve handling

01f0f3e

fix check against tensor

2f3f3d0

make configurable

d6f334a

use einops rearrange for a view, and don't hardcode ignore index

b6e02ef

fix validation

7430615

winglian closed this Aug 26, 2024

winglian deleted the fused-cel branch August 26, 2024 16:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused Cross Entropy Loss #1601

Fused Cross Entropy Loss #1601

winglian commented May 7, 2024

YouJiacheng commented May 9, 2024

winglian commented May 10, 2024

YouJiacheng commented May 10, 2024 •

edited

Loading

winglian commented May 11, 2024

YouJiacheng commented May 13, 2024

YouJiacheng commented May 13, 2024

winglian commented Aug 26, 2024

Fused Cross Entropy Loss #1601

Fused Cross Entropy Loss #1601

Conversation

winglian commented May 7, 2024

YouJiacheng commented May 9, 2024

winglian commented May 10, 2024

YouJiacheng commented May 10, 2024 • edited Loading

winglian commented May 11, 2024

YouJiacheng commented May 13, 2024

YouJiacheng commented May 13, 2024

winglian commented Aug 26, 2024

YouJiacheng commented May 10, 2024 •

edited

Loading